Splitting Noun Compounds via Monolingual and Bilingual Paraphrasing: A Study on Japanese Katakana Words

نویسندگان

  • Nobuhiro Kaji
  • Masaru Kitsuregawa
چکیده

Word boundaries within noun compounds are not marked by white spaces in a number of languages, unlike in English, and it is beneficial for various NLP applications to split such noun compounds. In the case of Japanese, noun compounds made up of katakana words (i.e., transliterated foreign words) are particularly difficult to split, because katakana words are highly productive and are often outof-vocabulary. To overcome this difficulty, we propose using monolingual and bilingual paraphrases of katakana noun compounds for identifying word boundaries. Experiments demonstrated that splitting accuracy is substantially improved by extracting such paraphrases from unlabeled textual data, the Web in our case, and then using that information for constructing splitting models.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Extracting French-Japanese Word Pairs from Bilingual Corpora based on Transliteration Rules

It has been shown so far that using transliteration rules to extract Japanese Katakana and English word pairs is highly useful and promising. But for Japanese-French pairs, the method is not guaranteed to work, because only a very few Japanese Katakana words are borrowed directly from French. In this paper we will show the possibility of extracting Japanese Katakana and French word pairs based ...

متن کامل

Automatically Harvesting Katakana-English Term Pairs from Search Engine Query Logs

This paper describes a method of extracting katakana words and phrases, along with their English counterparts from non-aligned monolingual web search engine query logs. The method employs a trainable edit distance function to find pairs that have a high probability of being equivalent. These pairs can then be used to further bootstrap training of the edit distance function, ...

متن کامل

Comparing and Extracting Paraphrasing Words with 2-Way Bilingual Dictionaries

We analyze a variety of lexical expressions with 2-way bilingual dictionaries and propose a method for extracting paraphrasing words. First, we compare the coverage between an English-Japanese dictionary and a Japanese-English dictionary from the viewpoint of the returnability of the words by translating English to Japanese, and then back to English again. The variety is shown using examples. N...

متن کامل

Automatic Acquisition of Basic Katakana Lexicon from a Given Corpus

Katakana, Japanese phonogram mainly used for loan words, is a trou-blemaker in Japanese word segmentation. Since Katakana words are heavily domain-dependent and there are many Katakana neologisms, it is almost impossible to construct and maintain Katakana word dictionary by hand. This paper proposes an automatic segmentation method of Japanese Katakana compounds, which makes it possible to cons...

متن کامل

Automatic Extraction of Translational Japanese-KATAKANA and English Word Pairs

The method to automatically extract translational Japanese-KATAKANA and English word pairs from bilingual corpora is proposed. The method applies all the existing transliteration rules to each mora unit in a KATAKANA word, and extract English word which matched or partially-matched to one of these transliteration candidates as translation. For instance, if there is a word ‘グラフ’ (graph) in Japan...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011